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^ Abstract 

^ Transforming a matrix over a field to echelon form, or decomposing the ma- 

, ^ , trix as a product of structured matrices that reveal the rank profile, is a fun- 

damental building block of computational exact linear algebra. This paper 
^ surveys the well known variations of such decompositions and transforma- 

tions that have been proposed in the literature. We present an algorithm 
^ to compute the CUP decomposition of a matrix, adapted from the LSP algo- 

v/~) rithm of Ibarra, Moran and Hui (1982), and show reductions from the other 

^ most common Gaussian elimination based matrix transformations and de- 

^ compositions to the CUP decomposition. We discuss the advantages of the 

^ CUP algorithm over other existing algorithms by studying time and space 

complexities: the asymptotic time complexity is rank sensitive, and compar- 
ing the constants of the leading terms, the algorithms for computing matrix 
invariants based on the CUP decomposition are always at least as good except 
^ in one case. We also show that the CUP algorithm, as well as the computation 

of other invariants such as transformation to reduced column echelon form 
using the CUP algorithm, all work in place, allowing for example to compute 
the inverse of a matrix on the same storage as the input matrix. 
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1. Introduction 



Gaussian elimination and the corresponding matrix decompositions, such 
as the LU decomposition A = LU as the product of lower triangular L and 
upper triangular U, are fundamental building blocks in computational linear 
algebra that are used to solve problems such as linear systems, computing 
the rank, the determinant and a basis of the nuUspace of a matrix. The LU 
decomposition, which is defined for matrices whose leading principal minors 
are all nonsingular, can be generalized to all nonsingular matrices by intro- 
ducing pivoting on one side (rows or columns), leading to the LUP or PLU 
decomposition, P a permutation matrix. Matrices with arbitrary dimensions 
and rank can be handled by introducing pivoting on both sides, leading to the 



LQUP decomposition of Ibarra et al. ( 1982 ) or the PLUQ decomposition ( Golub 



and Van Loan, 1996 Jeffrey, 2010), Q a second permutation matrix. We re- 



call the precise definitions of these decompositions in Section [2j For now, we 
note that the decompositions are not unique. In particular, they can differ 
depending on the pivoting strategy used to form the permutations P and Q. 



Whereas in numerical linear algebra (Golub and Van Loan 1996) pivoting 



is used to ensure a good numerical stability, good data locality, and reduce 
the fill-in, the role of pivoting differs in the context of exact linear algebra. 
Indeed, only certain pivoting strategies for these decompositions will reveal 
the rank profile of the matrix (see Section [2]), which is crucial information 
in many applications using exact Gaussian elimination, such as Grobner ba- 



2007). 



sis computations (Faugere, 1999) and computational number theory (Stein 



In this article we consider matrices over an arbitrary field and analyze 
algorithms by counting the required number of arithmetic operations from 
the field. Many computations over Z or Q (including polynomials with co- 
efficients from these rings) reduce to linear algebra over finite fields using 
techniques such as linearization or homomorphic imaging (e.g., reduction 
modulo a single prime, multi-modular reduction, p-adic lifting). Applica- 
tions such as cryptanalysis make intensive use of linear algebra over finite 
fields directly. Unlike arithmetic operations involving integers, the cost of an 
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operation in a finite field does not depend on the value of its operands, and 
counting the number of field operations is indicative of the actual running 
time. 

The asymptotic time complexity of linear algebra of matrices over a field 
has been well studied. While Gaussian elimination can be performed on an 
n X n matrix using 0{n^) arithmetic operations, a subcubic complexity of 
O(n^) with u < 2.38 the exponent of matrix multiplication (see 



Gathen and Gerhard, 2003, §12.1) is obtained using Bunch and Hopcroft 



von zur 



(1974) block recursive algorithm for the LUP decomposition. Algorithms for 



computing the LSP and LQUP decompositions of rank deficient matrices are 



given by Ibarra et al. (1982). As an alternative to decomposition, Keller 



Gehrig (1985) adapts some earlier work by Schonhage (1973) to get an algo- 
rithm that computes a nonsingular transformation matrix X such that AX 
is in column echelon form. For an m x matrix such that m < n, all of the 
Gaussian elimination based algorithms just mentioned have an 0{nm'^~^) 
time complexity. 

Only recently has the complexity of computing a rank profile revealing de- 
composition or a transformation to echelon form been studied in more details: 
algorithms with a rank-sensitive time complexity of 0{mnr^~'^) for com- 
puting a transformation to echelon and reduced echelon forms are given by 



Storjohann and Mulders (1998) and Storjohann (2000); similar rank-sensitive 



time complexities for the LSP and LQUP decompositions of Ibarra et al. ( 1982 ) 



have been obtained by Jeannerod (2006). On the one hand, while offering a 



rank-sensitive time complexity, the algorithms in (Storjohann and Mulders 



1998 


Storjohann 


2000 


Jeannerod , 


2006) 



3 not necessarily in-place. On 
the other hand, Dumas et al. (2008) describe an in-place variant of the LQUP 
decomposition but, while analyzing the constant in the leading term, does 
not achieve a rank-sensitive complexity. 

The aim of this article is to gather, generalize, and extend these more 
refined analyses to the most common Gaussian elimination based matrix 
decompositions. We propose an algorithm computing a new matrix decom- 
position, the CUP decomposition, and show reductions to it for all common 
matrix decompositions. We assess that, among all other matrix decomposi- 
tions, the algorithm for CUP is the preferred Gaussian elimination algorithm 
to be used as a building block routine for the following reasons. 

1. We show how all other decompositions can be recovered from the CUP 
decomposition. Furthermore, the time complexities for all algorithms 



3 



computing the alternative decompositions and transformations (con- 
sidering the constant in the leading term) are never better than the 
proposed reduction to CUP, except by a slight amount in one case. 

2. The complexity of the algorithm for CUP is rank sensitive (whereas no 
such result could be produced with some of the other algorithms). 

3. The reduction to CUP allows us to perform these computations in-place, 
that is, with essentially no more memory than what the matrix products 
involved already use. 

4. The CUP decomposition offers the best modularity: combined with other 
classic routines like TRSM, TRTRI, TRULM and TRLUM (which shall be 
recalled in the course of the article), it allows to compute solutions 
to other linear algebra problems such as rank, rank profile, nullspace, 
determinant and inverse with the best time complexities. 

1.1. Notation and definitions 

Matrix entries are accessed using zero-based indexing: for example, ao,i 
denotes the entry lying on the first row and second column of the matrix 
A = [ttiA. In order to make the description of our algorithms simpler, we 



adopt the following convention (Pilgrim, 2004): intervals of indices include 



the lower bound but exclude the upper bound, that is, 

a..b = {a, a + l,...,6 — 1}. 

Our algorithms make heavy use of conformal block decompositions of matri- 
ces. In the literature, new variable names are typically used for each block. 
Instead, we refer to blocks using intervals of indices in order to emphasize 
the fact that no memory is being allocated and that our algorithms actually 
work in-place. The notation A'^''l represents the submatrix of A formed by 
the intersection of rows of index from a to 6 — 1, and the columns of index 
from c to (i — 1. For example, a 2 x 2 block decomposition of an m x n matrix 
A shall be referred to as 



" AO..e 

^Q..k 


Ai-.n 
^O..k 


AO.I 
^k..m 


A^-n 
k..m _ 



A 



for suitable integers k and Similarly, A^ -'^ denotes the subrow of row a of 
A whose column indices range from do d — 1. 

Following Ibarra et al. (1982), we say a rectangular matrix A = [ajj] is 
upper triangular if = for i > j, and that it is lower triangular if its 
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transpose is upper triangular. A matrix that is either upper or lower 
triangular is simply called triangular, and if in addition ai^i = 1 for all i 
then it is called unit triangular. For two m x n matrices L and U such that 
L = [iij] is unit lower triangular and U = [uij] is upper triangular, we shall 
use the notation to express the fact that L and U are stored one next 

to the other within the same m x n matrix. Thus, A = [L\U] is the m x n 
matrix A = [aij] such that aij = iij ii i > j, and = Uij otherwise. 

For a permutation a : {0, ... ,n — 1} — )■ {0, . . . , ri — 1}, the associated 
permutation matrix P = [Pi,j]o<i,j<n is defined by -Pi,cr(i) = 1 and Pij = for 
j 7^ cr(i). Multiplying a matrix on the left by P applies the permutation a 
to the rows of that matrix, while multiplying on the right by the transpose 
P^ = P~^ applies a to its columns. We denote by Tjj the permutation 
matrix that swaps indices i and j and leaves the other elements unchanged. 

1.2. Organization of the article 

Section |2] reviews and exposes the links between the most common ma- 
trix decompositions originating from Gaussian elimination: LU, LUP, Turing- 
Decomposition, LSP, LQUP, and QLUP, together with our variant of the LSP 
decomposition, the CUP decomposition. Section [3] gives algorithms to com- 
pute all these matrix decompositions with no extra memory allocation. More 
precisely, we give an algorithm for computing a CUP decomposition and show 
how all of the other decompositions can be derived from it. The advantages of 
the CUP algorithm compared to other existing algorithms for Gaussian elim- 
ination are discussed in Section |4| based on an analysis of time and space 
complexities. In Section [5] we comment on the PLE decomposition, which is 
the row-counterpart of the CUP decomposition, and we conclude in Section [6] 

2. Review of Gaussian elimination based matrix decompositions 

Throughout this section, let A be an m x n matrix with rank r. The row 
rank profile of A is the lexicographically smallest sequence of r row indices 
Iq < ii < ■ ■ ■ < ir-i such that the corresponding rows of A are linearly 
independent. The matrix A is said to have generic rank profile if its first r 
leading principal minors are nonzero. 

2.1. LU based decompositions 

If A has generic rank profile it has a unique LU decomposition: A = LU 
for L an m X m unit lower triangular matrix with last m — r columns those 
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of the identity, and U anmxn upper triangular matrix with last m — r rows 
equal to zero. In our examples, zero entries of a matrix are simply left blank, 
possibly nonzero entries are indicated with *, and necessarily nonzero entries 
with *. 

Example 1. The LU decomposition of a 6 x 4 matrix A with rank 3. 
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* * 


*- 




■1 








* * * * 
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* * 


* 
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* * * 
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* * 


* 
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* * 
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* * 


* 
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* 1 
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* * 


* 
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* 1 








* * 
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* 1 







Uniqueness of U and the first r rows of L follows from the generic rank 
profile condition on A, while uniqueness of the last m — r rows of L follows 
from the condition that L has last m — r columns those of the identity ma- 
trix. If A has row rank profile [0, 1, . . . , r — 1], then it only takes a column 
permutation to achieve generic rank profile; for such input matrices, Aho 



et al. (1974, §6.4) and Bunch and Hopcroft (1974) give a reduction to ma- 
trix multiplication for computing an LUP decomposition: A = LUP, with P a 
permutation matrix and AP^ = LU the LU decomposition of AP^. Allowing 
row and column permutations extends the LU decomposition to any matrix, 
without restriction on the rank profile. In the context of numerical com- 
putation, the row and column permutations leading to the best numerical 
stability are chosen (see for example Golub and Van Loan, 1996, §3.4), and 
are referred to as partial pivoting (e.g., column permutations only) and com- 
plete pivoting (column and row permutations). In the context of symbolic 
computation, a key invariant of the input matrix that should be revealed is 
the rank profile. In the next subsection we recall the most common matrix 
decompositions based on the column echelon form of A, thus all revealing 
the row rank profile of A. 

2.2. Rank profile revealing decompositions 

Let y4 be an m X -n. input matrix with row rank profile [io, ■ ■ ■ , ir-i]- Recall 
how the iterative version of the Gaussian elimination algorithm transforms 
A to column echelon form. The algorithm detects the row rank profile of A 
during the elimination. For i = iq, . . . ,ir-i, a pivoting step (interchange of 
two columns) is performed, if required, to ensure the pivot entry in row ij and 
column j is nonzero, and then entries to the right of the pivot are eliminated. 
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By recording the column swaps separately in a permutation matrix P, and 
the eliminations in a unit upper triangular matrix U, we arrive at a structured 
transformation of the input matrix to column echelon form AP^U. 

Example 2. The following shows the structured transformation of a 7 x 5 
matrix A with row rank profile [0, 3, 4] to column echelon form C. 



* * * * 

* * * * 

* * * * 

* * * * 

* * * * 

* * * * 

* * * * 



u 

"1 * * * 

1 * * * 
1 * * 

1 

1 



_ c 

* * * 

* * * 



Once an echelon form C has been obtained, post-multiplication by a 
diagonal matrix D can be used to make the pivot elements in the echelon 
form equal to 1, and a further post-multiplication by a unit lower triangular 
L can be used to eliminate entries before each pivot, giving the canonical 
reduced column echelon form R oi A. 

Example 3. The following shows the structured transformation of the ma- 
trix from Example [2] to reduced column echelon form. 



* * * 



U 

1 * 
1 



D 



R 



— 1 



1 

* 1 

* * 1 



Proposition 1. Let A be an m x n matrix with row rank profile [io, • • • , "^r-i]- 
Corresponding to any n x n permutation matrix P such that the submatrix 
of AP^ comprised of rows io . . . , V-i has generic rank profile, there exists 

• a unique n x n unit upper triangular matrix 



U 



such that C = AP^U is a column echelon form of A, and 



7 



a unique diagonal matrix D = Diag(Di, /„_,.) and n x n unit lower 
triangular 



such that R = AP^UDL is the reduced column echelon form of A. 

The literature contains a number of well known matrix decompositions 
that reveal the row rank profile: the common ingredient is a permutation 
matrix P that satisfies the requirements of Proposition [Tj The Turing de- 
composition of the transpose of A is as shown in Example [3] except with the 
matrices U, D, L, and P inverted and appearing on the right-hand side of 



the equation, and was introduced by Corless and Jeffrey (1997) as a general- 
ization to the rectangular case of the square decomposition A = LDU given 
by [Turiiigl in his seminal [1948| paper. The LSP and LQUP decompositions are 



due to Ibarra et al. (1982). A more compact variant of LQUP is QLUP, and the 
CUP decomposition is another variation of the LSP decomposition that we in- 
troduce here. The LQUP and QLUP decompositions also involve a permutation 
matrix Q such that the first r rows of QA are equal to rows io, . . . , V-i of 
A. Once the permutations P and Q satisfying these requirements are fixed, 
these five decompositions are uniquely defined and in a one-to-one corre- 
spondence. The following proposition links these decompositions by defining 
each of them in terms of the matrices U, C, D, L, and R of Proposition [T] 
For completeness, the proposition begins by recalling the definitions of the 
classic transformations of A to column echelon form and to reduced column 
echelon form. 

Proposition 2. Corresponding to an n x n permutation matrix P such that 
rows io, ■ ■ ■ , ir-i of AP^ have generic rank profile, let U, C, D, L and R be 
the matrices defined in Proposition [l} The following transformations exist 
and are uniquely defined based on the choice of P. 



ColEchTrans: (P, U, C) such that AP'^U 
Example: 



C. 



* * * 



U 

1 * * * *■ 
1 * * * 
1 * * 
1 

1 



_ C 

* 
* 

* * * 

* * * 
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RedColEchTrans: (P, X, R) such that AP'^X = R with X = UDL. 
Example: 



* * * 

* * * 

* * * 

* * * 

* * * 



R 



* * 

* * 

* * 

* * 

* * 



X 

* * * 1 * 
1 



Now let L — L = D ^ and U = U ^. The following decompositions 
exist and are uniquely defined based on the choice of P. 

• TuringDecomposition: {R, L, D, U, P) such that AP^ 
Example: 



RLDU. 



R 







■1 










* 










* 












1 












1 
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* 


* 






* 


* 


* 



D 



rl 1 




* 1 




* * 1 




1 




1. 







u 

1 * * * * 

1 * * * 
1 * * 

1 

1 



Example: 



AP"^ 



C 

* 

* * * 

* * * 





u 






rl * 


* 


* 




1 


* 


* 


* 




1 


* 


* 






1 










1 



LSP: (L', S, P) such that A = L'SP with 

— L' an m X m unit lower triangular with columns i^, . . . , i^-i equal 
to columns 0,...,r — lofCD, and other columns those of I^- 

— S an m X n semi upper triangular matrix with rows io, . . . , V-i 
equal to rows 0, . . . ,r — 1 of DU, and other rows zero. 
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Example: 



rl 

* 1 
* 
* 
* 

* 
* 



1 

* 1 

* * 1 

* * 



*i 



s 

* * 



* * 



In addition to P, let Q be an m x m permutation matrix such that rows 
0, . . . , r — 1 of QA are rows io, . . . , of A. Then the following decomposi- 
tions exist and are uniquely defined based on the choice of P and Q. 

• LQUP: {L',Q,U',P) such that A = L'QU'P, with L' the same matrix 
as in the LSP decomposition, and U' an m x n matrix with first r rows 
equal to the first r rows of DU, and last n — r rows zero. 
Example: 

u' 

5|C 



L' 



1 

* 1 

* * 1 

* * 1 



• QLUP: (Q, -P) with L" an m x r unit lower triangular matrix 

equal to the first r columns of Q^CD, and U" an r x n upper triangular 
matrix equal to the first r rows of DU. 
Example: 

ApT L 
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* 






■1 
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* 


* 


* 
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1 
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* 
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* 




* 
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* 


* 
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* 




* 


* 


* 


* 


* 


* 


* 


* 




* 


* 


* 



Note that the LSP, LQUP, and QLUP decompositions can be transformed 
from one to the other without any field operations. 
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3. The CUP matrix decomposition: algorithm and reductions 



In this section we propose an algorithm for the CUP decomposition, and 
show how each of the other four decompositions and two transformations of 
Proposition [2] can be recovered via the CUP decomposition. For clarity we 
postpone to Section |4] the discussion of the advantages, in terms of time and 
space complexities, of the CUP decomposition algorithm over other Gaussian 
elimination based agorithms in the literature. 

All algorithms presented in this section are recursive and operate on 
blocks; this groups arithmetic operations into large matrix multiplication 
updates of the form C = aAB + /3C. Reducing dense linear algebra to ma- 
trix multiplication not only leads to a reduced asymptotic time complexity 
because of the subcubic exponent of matrix multiplication, but also ensures 
good efficiency in practice thanks to the availability of highly optimized sub- 



routines (see for example Dumas et al. , 2002). 



3.1. Space-sharing storage and in-place computations 

Dealing with space complexity, we make the assumption that a field el- 
ement as well as indices use an atomic memory space. We distinguish two 
ways to reduce the memory consumption of our algorithms: 

Space-sharing storage. All the algorithms described here, except for matrix 
multiplication, store their matrix outputs (excluding permutations) in the 
space allocated to their inputs. For example, when solving the triangular 
system with matrix right-hand side UX = B, the input matrix B will be 
overwritten by the output matrix X = U~^B. In the case of matrix decom- 
positions, the output consists of one or two permutations that can be stored 
using an amount of space linear in the sum of the matrix dimensions, and 
two structured matrices that can be stored together within the space of the 
input matrix. We call this a space-sharing storage. 

The space-sharing storage for the CUP and ColEchTrans of a matrix of 
rank r stores the first r columns of the lower triangular factor below the first 
r rows of the upper triangular factor in the same matrix. Overlap is avoided 
by storing only the nontrivial diagonal entries. Here is an example for a CUP 
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decomposition. 



c 

* 

* * * 

* * * 



(7 

* * * *■ 
1 * * * 
1 * * 
1 

1 



* * * 

* * * 



The RedColEchelonTrans can be stored in a space-sharing manner up to 
some permutation. Indeed, any transformation to reduced column echelon 
form of rank r can be written as 



AP 



Xi X2 

In — T" 



R = Q 



r 

no 



for some permutation matrix Q. Up to this permutation of rows, this allows 
the space-sharing storage 

Xi X2 

.R . ' 

Similar space-sharing storage formats are possible for the QLUP, LSP, LQUP 
and TuringDecomposition. 

In-place computation. More importantly, we will also focus on the interme- 
diate memory allocations of the algorithms presented. As the algorithms rely 
heavily on matrix multiplication, one needs to take into account the memory 
allocations in the matrix multiplication algorithm. Using the classical O(n^) 
algorithm, one can perform the operation C ^ aAB + pC with no addi- 
tional memory space beyond that required to store the input and output, 
but this is no longer the case with Strassen and Winograd's 0(n^'^^) algo- 



rithms (for a detailed analysis see for example Huss-Lederman et al. [1996 



Boyer et al. 2009). Consequently, we will consider matrix multiplication as a 



black-box operation, and analyze the memory complexity of our algorithms 
independently of it. This leads us to introduce the following definition. 

Definition 1. A linear algebra algorithm is called in-place if it does not 
involve more than 0(1) extra memory allocations for field elements for any 
of its operations except possibly in the course of matrix multiplications. 

In particular, when classical matrix multiplication is used, in-place linear 
algebra algorithms only require a constant amount of extra memory alloca- 
tion. 
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3.2. Basic building blocks: MM, TRSM, TRMM, TRTRI, TRULM, TRLUM 

We assume that Algorithm [T] for matrix multiphcation is available. We 



Algorithm 1: m{A, B, C, a, (3) 
Data: A an m x i matrix. 
Data: B an £ x n matrix. 

Data: C an m x n matrix that does not overlap with either A or B. 
Data: a a scalar. 
Data: /3 a scalar. 
Ensures: C ^ aAB + (5C . 



also use the well-known routines TRSM and TRTRI, defined in the level 3 



BLAS legacy (Dongarra et al. 1990) and the LAPACK library (Anderson 



et al. , 1999 ). The TRSM routine simultaneously computes several linear system 



solutions X from an invertible triangular matrix A and a matrix right-hand 
side B. The matrix A can be lower or upper triangular, unit or non-unit 
triangular, left-looking [AX = B) or right-looking [XA = B). Algorithm [2] 
below illustrates the case "right-looking, upper, non-unit," and shows how to 



incorporate matrix multiplication; the algorithm is clearly in-place (Dumas] 
et al. , 2008, §4.1) and has running time 0{max{'m,n} n^~^). 



Algorithm 2: TRSM(Right, Up, NonUnit, U,B) 



Data: U an n x n invertible upper triangular matrix. 
Data: B an mx n matrix. 
Ensures: B ^ BU~^. 
1 begin 

it n = 1 then 
I for ? ^ 0, . . . , m — 1 do 6j q ^ ^i,o/'^^o,o 
else 

/* S = [Bi B2] with Bi m X k and U = 

TRSM(Right, Up, NonUnit, f/°•■fc^ 5, 
MM(fiO-i,t/^f,5,*^-" 



Ui V 
U2 



with Ui k X k 



M..k \ 
'0..m) 



TTk..n TDk..n i i\ 



*/ 
*/ 

/* B2- BiV */ 



/* Bi ^ BiU^ 



-1 



TRSM(Right, Up, NonUnit, f/|;;^, B^Z) 



/* B2^B2U^ 



-1 



*/ 
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We remark that an algorithm similar to Algorithm |2] can be written for 
the so-called TRMM routine, which computes the product of a triangular matrix 
by an arbitrary matrix, in the same eight variants (for the square case see 
for example Dumas et al. , 2008, §6.2.1). Clearly, TRMM has the same in-place 



and complexity features as TRSM. 

The TRTRI routine inverts a triangular matrix that can be either upper 
or lower triangular, unit or non-unit triangular. Algorithm [3] illustrates the 
case "upper, non-unit" and shares the following features with the three other 
variants: it reduces to matrix multiplication via two calls to TRSM in half the 
dimension, has a cost in 0{n'^), and is in-place. 



Algorithm 3: TRTRI (Up, NonUnit, U) 



Data: U an n x n invertible upper triangular matrix. 

Ensures: U ^ U~^. 

begin 

if n = 1 then 

I Mo,o ^ 1/^0,0 
else 



6 
7 
8 
9 
10 



/* u 



Ui V 

U2. 



with Ui k X k, s . t . U 



k..n\ 
k..n ' "-^O.-fe ) 
0..k jjk..n\ 



TRSM(Right, Up, NonUnit, f/, 
TRSM(Left, Up, NonUnit, U^-^ , ^ 

TTk,.n J Jjk..n 

^Q..k ^ '^O..k 

TRTRI ([/°-fc^) 
TRTRliUt:) 



va 



-1 



*/ 



/* V 

/* F^^i 
/* V ^-v */ 

/* Ui 

/* U2 



uv^v */ 



^c/r^ */ 



We conclude this section by introducing the TRULM and TRLUM routines: 
given a unit lower triangular matrix L and an upper triangular matrix U 
stored one next to the other within the same square matrix, they return the 
products UL and LU ^ respectively. Unlike TRSM and TRTRI, these routines 
are neither BLAS nor LAPACK routines, and to our knowledge have not 
yet been described elsewhere. Algorithms |4] and |5] show how they can be 
implemented in-place and at cost O(n^). The routine TRULM is the key step 
that enables us to compute the reduced echelon form in-place and in subcubic 
time (see Algorithm [s]), while TRLUM will be used to derive a fast and in-place 
method for matrix products of the form B A x B (see Algorithm^. 



14 



Algorithm 4: TRULM (A) 



Data: A -- 
Ensures: 
1 begin 



= [L\U] an n X n matrix. 
A ^ UL. 



if n > 1 then 

/* A = 



and LU 



Li\Ui U2 

TWLKiA^-i) 
MM(Ag-j^,AO;;^,<:^,l,l) 
TRMM(Right, Low, Vnit, A^-^, A'^:.k 
TRMM(Left, Up, NonUnit, yl^-" 



Xi X2 

X3 X4_ 



TRULM(Af-") 



n AO..k\ 



with Li,Ui,Xi kxk */ 

/* Xi^UiLi */ 
/* Xi^Xi + U2L2 */ 
/* X2 ^ U2L3 */ 
/* X3 ^ U3L2 */ 
/* X.i ^ U-:a,L?, */ 



Algorithm 5: TRLUM {A) 



Data: A 
Ensures: 
begin 

if n > 1 then 

A = 



[L\U] an n x n matrix. 
A^LU. 



TRLUM(A 

i0..fe 



m{A^-^ 



Li\Ui U2 
L2 L3,\U3 

k..n\ 
k..nJ 
Ak..n Ak..n 
^O..k^^ 



and UL 



Xi X2 
X3 X4 



k..n^ 1' 1) 

k /\k..n\ 
k-> -^O.-k) 

TRMM(Right, Up, NonUnit,AO -|,AO -^) 
TRLUM(>10::|) 



TRMM(Left, Low, Unit, A^-^) 



with Li,Ui,Xi kxk 



*/ 



/* ^ Lgf/g 

/* ^4^X4 + L2U2 */ 

/* X2 ^ L1U2 */ 

/* X2,^L2Ui */ 

/* Xi^LiUi */ 
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3.3. The CUP matrix decomposition algorithm 

We now present Algorithm |6j a block recursive algorithm for factoring 
any m x n matrix A as A = CUP with C a column echelon form reveal- 
ing the row rank profile of A, f/ a unit upper triangular matrix, and P a 
permutation matrix. It is a variation on the LSP and LQUP decomposition 



algorithms of Ibarra et al. (1982), similar to the ones presented by Jean- 



nerod 


(2006 


) and 


Dumas et al. 


(2008 



of Section 3.2, the description using submatrix indices shows that the entire 
algorithm can be performed in-place: each recursive call transforms a block 
into a block of the form [C\f/], and the remaining operations are a TRSM, a 
matrix multiplication, and applying row and column permutations. The CUP 
algorithm also handles the case where m > n. The ability to work in-place 
and handle matrices of arbitrary shape are the main advantages of the algo- 
rithm for CUP decomposition over the LSP or LQUP of Ibarra et al. (1982), as 
will be discussed in Section HI 





Vi 




A21 


A22 



\f/i 


Vi 




G 


A22 











G 


\U2 
C2\\ 


V2 








Figure 1: Illustration of Algorithm |6] steps [T2| [Tsl [24l and [27 



Figure [T] shows the state of the input matrix after the four main steps 
(steps 12, 18, 24, and 27). At step 27, the zero rows between blocks Vi and 
[U2 V2] are shifted down. 

Proposition 3. Algorithm [6] computes a CUP decomposition of the input 
matrix A together with its rank r and row rank profile [iq, . . . , V-i]- 



Proof. It is clear from the description of the algorithm that P is a per- 
mutation matrix, U is unit upper triangular, and C is in column echelon 
form. To check that A = CUP, one can proceed exactly as Jeannerod (2006 
§3.1). □ 
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Algorithm 6: CUP(A) 



Data: A an m x n matrix. 
Result: (r, [io, . . . ,v_i],P). 

Ensures: A ^ [C\U] such that, embedding U in an n x n unit upper 
triangular matrix makes A = CUP a CUP decomposition of 
A, and r and [io, . . . , ir-i] are the rank and the row rank 
profile of A. 



1 begin 

2 



if m = 1 then 
ii A^O then 

i •<— column index of the first nonzero entry of A 
P ■<— To.i, the transposition of indices and i 

A^ AP 

for i ^ 2 . . .n do Ao,i Ao^iA^g 
return (1, [0],P) 

return (0, [],/„) 

Ai 



k ^ [fj /* A 

(ri,[2o,...,Vi-i],A)^CUP«-) 
if ri = then 

(r2,h,...,^.,-i],P2)^CUP(A0;-) 
return (ra, [io + fc, . . . , i^^-i + k], P2] 

P[ /* [A 





^2 



with A-i kxn */ 
Ci [C/i Fi] Pi */ 



/10..J1 , AO 
k..m k..m 
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/* A 



A21 A22 

TRSM(Right, Upper, Unit, A'^^^I^AI^) 
if ri = n then 

[_ return (ri, [io, Pi) 

1,1) 

i],P2)^cupK!-) 



A22] ^ [A21 A22] Pf */ 

with Ci\Ui k X ri and Vi n x (n — ri) */ 

/* G^^2ii7r^ */ 



mAi2,A'-i,Ai]:: 



i0..ri 

{r2, [io, 



, Jr2- 
..n 
ri' 

for J ■<— ri . . . ri + r2 do 



Ari..n . Ar\..n pT 
^O..n ^O..ri -'2 



/* H^A22-GVi */ 
/* H^C2U2P2 */ 

/* Vi ^ FiPj */ 



Aj+k-ri 



^ ^n..j+fe-r-i 



/* Moving U2,V2 up next to Vi */ 



^[0] 

P^Diag(7,,,P2)Pi 
return (ri + r2, [io, ■ 



,«ri-i,io + A;, . . . , >2_i + k],P) 
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3.4- ColEchTrans trans ormation via a CUP decomposition 

Proposition |4] states how to compute the required transformation matrix 
X such that AX = C is in echelon form. 

Proposition 4. Let A = CUP be the CUP decomposition of an m x n 
matrix A of rank r computed with Algorithm |6} Let X be the inverse of U 
embedded in an x n unit upper triangular matrix. Then AP^X = E is a. 
transformation of A to column echelon form. 

Proof. One need only verify that CX^^ = CUP = A. □ 



Algorithm 7: ColEchTrans(y4) 



Data: A an m x n matrix over a field. 

Result: (r, [Iq, . . . , P) where P is a permutation matrix. 

Ensures: A ^ [C\X] such that, embedding X in an n x n unit upper 
triangular matrix makes AP^X = C a transformation of A 
to column echelon form, r = rank(y4) and [zq, . . . , ir-i] is the 
row rank profile of A. 



1 begin 

2 



(r, [zo,...,z._i],P) ^CUP(A) 
/* A= C\Ui ^ where Ui is r x r upper triangular */ 
TRSM(Left, Upper, Unit, Ag:;, AS:,;!) /* M ^ f/f */ 



A 



0..r 



_ Ar..n 
^Q..r 



TRTRI (Upper, 



0..rJ 



/* N i M */ 

/* u^u-^ */ 



return (r, [io, . . . , V-i],P) 



Algorithm [7| shows how the computation of the column echelon form can be 
done in-place and reduces to that of CUP, TRSM and TRTRI and therefore has 
a time complexity of 0{mnr'^^'^). 

3.5. RedColEchTrans transformation via CUP decomposition 

Proposition [5] states how the reduced column echelon form can be com- 
puted from the CUP decomposition. Recall that Tij denotes the permutation 
matrix that swaps indices i and j and leaves the other elements unchanged. 



18 



Proposition 5. Let (r, [iq, 



-1 ' 



P) and [C\Y] be the output of Algo- 



rithm [7] called on an m X n matrix A. As previously, assume that the 
matrix Y is embedded in an n x n unit upper triangular matrix. Let 

Q = To^i^Ti^i^ . . .Tr-i,i^_^ so that QC = 

singular lower triangular matrix. Let 



where Li is an r x r non- 



R = C 



and X = F 



Then AP^X = R is a transformation of A to reduced column echelon form. 



Proof. We start by showing that R is in reduced column echelon form. 
Recall that the permutation matrix Q is such that row of a matrix A is 
row ik of Q^A. Hence for j G {0 ... r — 1}, row ij of R is row j of the identity 
matrix Im- 

Consider column j of R. From the above remark, it has a 1 in row 
ij. We now show that any coefficient above it is zero. Let i < ij, then 
Rij = C^i,k[^~^]k,j- As C is in column echelon form and i < ij, Ck,i = 

for any k > j. Since is lower triangular, L^^ = for any k < j, hence 
Ri,j = 0. 

It remains to verify 



AX = CUPP^Y 



LI 



-1 



R. 



□ 



Algorithm |8] shows how the computation of the reduced column echelon form 
transformation can be done in-place and reduces to that of CUP, TRSM, TRTRI, 
and TRULM and therefore has a time complexity of 0{mnr'^^'^). 



4. Discussion 



Because it is the building block of dense linear algebra algorithms, Gaus- 
sian elimination appears in many different forms in the literature and in 
software implementations. Softwares are mostly based on either a transfor- 
mation to echelon form, or one of the decompositions of Proposition [2| from 
which the usual computations such as linear system solving, determinant, 
rank, rank profile, nuUspace basis, etc. can be derived. We discuss here the 
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Algorithm 8: RedColEchTrans(A) 



Data: A an m x n matrix over a field. 

Result: (r, [ig, • • • , ir-i],P) where P is a permutation matrix. 



Ensures: A ^ 



Xi X2 
R2 



such that if X 



Xi X2 



T 



n—r 

L 



R,0 



and 
then 



Q — ^o,io^i,ii • • • ^r-i,v_i and R — Q 
AP^X = i? is a transformation of A to reduced column 
echelon form, r = rank(yl), and [29, • • • , V-i] is the column 
rank profile of A. 



1 begin 



(r, [if), . . . , ir-i], P) ^ ColEchTraiis(yl) 

/* Notations: A = [C\Ti T2] and X2 ^ T2 . 

for j ^ . . . r — 1 do 



L2 



*/ 



^QC */ 



TRSM(Right, Lower, Unit, A%-J^, 

TRTRI (Lower, 

TRULM(/lO:;) 

return (r, [zq, . . . , V-i],P) 



/* R2 ^ L2L\^ */ 
/* N ^L\^ */ 
/* Xi ^ TiiV */ 



advantages of the CUP decomposition algorithm presented in the previous 
section compared to some other algorithms in the literature. 

In our comparison we only consider algorithms that reduce to matrix 
multiplication. Indeed, reduction to matrix multiplication is the only way to 
benefit from both the best theoretical complexity and the practical efficiency 



of highly optimized implementations of the level 3 BLAS routines (Dumas 



et al., 2008). 



To our knowledge, the algorithms achieving subcubic time complexity for 
elimination of rank deficient matrices are 



the LSP and LQUP algorithms of Ibarra et al. ( 1982 ) for computing an 
LSP and LQUP decomposition. 



the Gauss and GaussJordan algorithms (Storjohann, 2000, Algorithms 



2.8 and 2.9) for computing an echelon and reduced echelon form trans- 
forms, and 
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the StepForm algorithm (Keller-Gehrig 1985 Biirgisser et aL 1997 



§16.5) for computing an echelon form. 
For the sake of completeness, we recall the Gauss Jordan and StepForm al- 



gorithms in Appendix A and Appendix B , together with an analysis of their 



complexity. Note that already in his seminal paper, Strassen ( 1969 ) proposed 
an 0{n^) algorithm to compute the inverse of a matrix under strong gener- 
icity assumptions on the matrix. The Gauss Jordan algorithm can be viewed 
as a generalization: it inverts any nonsingular matrix without assumption, 
and computes the reduced echelon form and a transformation matrix in the 
case of singular matrices. 

Note that all references to QLUP decomposition that we know of (namely 



(Golub and Van Loan, 1996, §3.4.9) and (Jeffrey, 2010)) do not mention any 
subcubic time algorithm, but as it can be derived as a slight modification of 
Algorithm [6} we will also take this variant into account for our comparison. 

Our comparison of the various algorithms will be based on both space 
complexity (checking whether the algorithms are in-place) and a finer analysis 
of the time complexity which considers the constant of the leading term in 
the asymptotic estimate. 



4.I. Comparison of the CUP decomposition with LSP, LQUP, QLUP 

LSP. As discussed in Section 3A, the LSP decomposition can not enjoy a 
space-sharing storage, unless its L matrix is compressed by a column permu- 
tatin Q. Now when designing a block recursive algorithm similar to Algo- 
rithm [6} one needs to compress the S matrix returned by the first recursive 
call, in order to use a TRSM update similar to operation [l8j This requires to 
either allocate an r x r temporary matrix, or to do the compression by swap- 
ping rows of 5* back and forth. Instead, if the upper triangular matrix is kept 
compressed during the whole algorithm, this becomes an LQUP decomposition 
algorithm. 

LQUP. Although LQUP decomposition can use a space sharing storage, the 
intermediate steps of a block recursive algorithm derived from Algorithm [6] 
would require additional column permutations on the L matrix to give it 
the uncompressed shape. Instead, if one chooses to compute the LQUP de- 
composition with a compressed L matrix, this really corresponds to the CUP 
decomposition, up to row and column scaling by the pivots. 
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QLUP. The QLUP decomposition can also be computed by an adaptation of 
Algorithm |6] where rows of the lower triangular matrix have to be permuted. 
Such an algorithm for the QLUP decomposition would then share the same ad- 
vantages as the CUP decomposition algorithm but the following three reasons 
make the CUP decomposition preferable. 

• The overhead of row permutations on the lower part of the matrix 
might become costly especially with sparse matrices. 

• Part of the structure of the matrix C is lost when considering L, Q 
instead: in C, any coefficient above a pivot, in a non-pivot row is known 
to be zero by the echelon structure, whereas this same coefficient in L 
has to be treated as any other coefficient, and be assigned the zero 
value. 

• It seems difficult to implement an efficient storage for the permutation 
Q (as can be done for P, using LAPACK storage of permutations). One 
could think of setting Q = T^^^^Ti^i^ . . . Tr-i^i^_^ after the algorithm has 
completed, as it is done for Algorithm [8j However this permutation 
does not correspond to the permutation that was applied to the non- 
pivot rows of L during the process of the algorithm (call it Q). We could 
not find any subquadratic time algorithm to generate this permutation 
Q from the two permutations Qi and Q2 returned by the recursive calls. 

The four matrix decompositions LSP, LQUP, PLUQ, and CUP are mathemat- 
ically equivalent: they can all be computed by a dedicated algorithm with 
the same amount of arithemtic operations, and any conversion from one to 
another only involves permutations and pivot scaling. Among them, the 
CUP decomposition stands out as the most natural and appropriate one from 
the computational point of view: it can use a space-sharing storage and be 
computed in place and with the least amount of permutations. 

4-2. A rank sensitive time complexity 

As one may expect, the rank of the input matrix affects the time complex- 
ity of the CUP algorithm. For example, using a naive cubic-time algorithm for 
matrix multiplication the CUP decomposition requires 2mnr — (?T, + m)r^ + 
field operations for an m x input matrix of rank r. 

Assuming a subcubic algorithm for matrix multiplication, the analysis in 
the literature for most Gaussian elimination algorithms is not rank sensitive. 
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For example, the running time of the LSP and LQUP algorithms (Ibarra et al. 
1982) is only shown to be 0{m'^~^n), assuming m < n. Following the analy- 



sis of the Gauss Jordan algorithm (Storjohann, 2000, Algorithm 2.8), we give 



in Proposition |6] a rank sensitive complexity for Algorithm |6] computing the 
CUP decomposition of an input matrix of arbitrary shape. According to the 
reductions of Section |3| the rank sensitive complexity bound of Proposition [6] 
also holds for the computation of all other decompositions and transforma- 
tions of Proposition [2j 

Proposition 6. Algorithm [6] computes a CUP decomposition of an m x n 
matrix of rank r using 0{mnr'^~'^) field operations. 



Proof. Denote by Tcup(^«,n, r) be the number of field operations required 
by Algorithm |6] for an m x n matrix A of rank r. 

In the following, we assume without loss of generality that m is a power 



of 2. Following Storjohann (2000) we count a comparison with zero as a 



field operation. Then, when r = (that is, A is the zero matrix), we have 
T(m, n, r) = 0{mn). As in the algorithm, let ri be the rank of Ai and let r2 
be the rank of H . Then 



Tcup(m,n, r) 



{0{n 
Tcup 
Tcup 
Tcup 

10( 



2 ' 



2 ' 
m 



o) + rc„p( 

n,n) + 0(f 



m 
2 ' 

m 



n, + 0{mn) 



if m 
if Ti 
if ri 



1, 
0, 
n. 



2^1 



n,rij + Tcup(y,n - ri,r2 
0(f (n-nX"' 



) + 



if < ri < n. 



Consider the ith recursive level: the matrix is split row-wise into 2* slices 
of row dimension m/2*. We denote by r^*"* the rank of each of these slices. 



indexed by j = ... 2* — 1. For example rf* = rank(y4Q 

At the ith recursive level the total number of field operations is in 



l0..n 



j=0 

Since w > 2 we have 



'2j+l 



2i -X^ ('^SjM-l 



j=0 



LU-2 



3=0 



, .. N OJ— 1 
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Now, using the fact that a"^'^ + h'^-'^ < 2^-'^{a + h^-'^ for 2 < w < 3, we 
also have 

Therefore, we obtain 

log2 m 

m 



1=1 




2* 

which for > 2 is in 0{mnr^~'^). □ 



We refer to Appendix B for a discussion on why the StepForm algorithm 



does not have a rank sensitive time complexity. 

Ji^.S. Space complexity 

In the presentations of Algorithms [2]— [6] we exhibited the fact that no 
temporary storage was used. Consequently all of these algorithms, as well as 
RedEchelon (Algorithm [s]), work in-place as per Definition [l} For a square 
and nonsingular input matrix, RedEchelon thus gives an in-place algorithm 
to compute the inverse. 

For comparison, the Gauss Jordan algorithm involves products of the type 



C ^ A X C (Algorithm 11, lines 17 and 23), which requires a copy of the 



input matrix C into a temporary storage in order to use a usual matrix 



multiplication algorithm. These matrix multiplication in lines 17 and 23 
could be done in-place using Algorithm |9| but the leading constant in the 
time complexity of this version of Gauss Jordan increases to 3 + 1/3 from 2 
for w = 3. 



We refer to Appendix B for a discussion on why the StepForm algorithm 
is not in-place. 

4.4- Leading constants 

For a finer comparison we compute the constant of the leading term in the 
complexities of all algorithms presented previously. For simplicity we assume 
that m = n = r (i.e., the input matrix is square and nonsingular). The com- 
plexity of each algorithm is then of the form K^^n'^ + o{n^) for some leading 
constant that is a function of the particular exponent u and correspond- 
ing constant such that two n x n matrices can be multiplied together in 
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Algorithm 9: InPlaceMM(A, B) 



Data: A an m x m matrix over a field. 
Data: B a.n m x n matrix over a field. 
Ensures: B ^ A x B. 

1 begin 

2 (r, [^0, . . . , V-i], P) ^ CUP(A) /* [C\U] */ 

3 B ^ PB 

4 TRMM(Left, Up, Unit, A, B) /* B ^ UB */ 

5 TRMM(Left,Low,NonUnit, l*B^CB^I 

6 TRLUM(v4) /* ^ ^ — CU */ 

7 A ^ — AP 



time Cojn'^ + o(n'^). To find the leading constant of an algorithm we sub- 
stituted T(n) = K^^n^ into the recurrence relation for the running time. The 
results are summarized in Table [T| which also gives the numerical values of 
K^^ for the choices [lo^CS] = (3,2) and (a;,C^) = (log2 7, 6), corresponding 
respectively to classical matrix multiplication and to Strassen-Winograd's 
algorithm (Winograd, 1971). Table [l] shows how the various rank-profile re- 



vealing elimination algorithms range in terms of time complexity: CUP is in 
2/3n^ + o(n^), transformation to echelon form is in + o{n^), and transfor- 
mation to reduced echelon form is in 2n^ + o(n^). 

Figure |2] summarizes the comparison between the three approaches 

• CUP ColEchTrans RedColEchTrans, 

• Gauss, and 

• GaussJordan 

with respect to their application to solving various classical linear algebra 
problems. 

The following observations can be made: 

• Algorithm CUP is sufficient (i.e., the best choice in terms of time com- 
plexity) for computing the determinant, the rank, the rank profile, and 
the solution of a linear system: all of these invariants can be computed 
in time K^^rt^ + o(n'^) where K^^ is the leading constant for Algorithm 
CUP, for example K^^ = 2/3 in the case where u = 3. 
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• Computing a transformation to echelon form (and thus a basis of the 
nullspace) is done by Algorithms ColEchTrans and Gauss with the 
same time complexity. In particular, as indicated in Table [T| the leading 
coefficient K^^ in the complexities of these algorithms is the same. 

• Computing a transformation to reduced echelon form can be done us- 
ing Algorithms RedColEchTrans or Gaussjordan. Morever, since the 
reduced echelon form of a nonsingular matrix is the identity matrix 
and the corresponding transformation matrix is the inverse, these al- 
gorithms are also algorithms for matrix inversion. The two algorithms 
have the same leading coefficient in the time complexity for w = 3 
(namely, 2n^), but for uj = log2 7 Algorithm Gaussjordan is 10% faster. 



CUP 



2/3 



Gauss 



Det 
Rank 

RankProfile 
Solve 



u 



TRTRI 



1/3 

CUP decomp. 
A=CUP 




Echelon 

Nullspace basis 



TRTRI 




TRTRM 


1/3 


c \ 


2/3 



Echelon form transform. 
AP"fU-i=C 



- Reduced 
Echelon form 
transform 

- Inverse 



Reduced Echelon 
form transform 
APTX=R 



Gaussjordan 



Figure 2: Application of rank revealing Gaussian elimination to linear algebra problems. 

We conclude that the CUP decomposition is the algorithm of choice to imple- 
ment rank profile revealing Gaussian elimination: it allows to compute all 
solutions in the best time complexity (except the case of the inverse with 
u = log 7, where Algorithm Gaussjordan is faster), either directly or using 
additional side computations. However, note that Gaussjordan is not in- 
place; applying the technique of Algorithm |9] to make Gaussjordan in-place 
increases the constant from 8 to about 21 in the case where u = logg 7. 
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5. Row echelon form and the PLE decomposition 



All the algorithms and decompositions of the previous sections deal with 
column echelon forms that reveal the row rank profile of the input matrix. 
By matrix transposition, similar results can be obtained for row echelon 
forms and column rank profile. The natural analogue to our central tool, 
the CUP decomposition, is the PLE decomposition: a tuple (P, L, E) such that 
{E^, L^, P^) is a CUP decomposition of the transposed input matrix A^: 



L 

■1 

* 1 

* * 1 

* * * 1 

* * * 1 



E 

*3 * * 



To compute a PLE decomposition one could apply Algorithm [6] with pre- and 
post-transpositions. A better option is to derive a "transposed" version of 
Algorithm [HI which directly computes a PLE decomposition in-place; as Al- 



gorithm 10 shows, this is immediate to achieve, and is what is implemented 



in the M4RI library (Albrecht and Bard, 2010) for dense linear algebra over 



GF(2). The compact rowmajor storage together with the applications driv- 
ing its development (namely Grobner basis computations f Faugere 1999)) 
imposed the row echelon as a standard. The FFLAS-FFPACK (The FFLAS- 



FFPACK Group 2011) library for dense linear algebra over word-size finite 



fields implements both the CUP and PLE Algorithms. 
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6. Conclusion 



We have presented an algorithm computing the rank profile revealing CUP 
decomposition of a matrix over a field. We have shown that the algorithm 
enjoys the following features: 

1. The algorithm reduces to matrix- matrix multiplications and therefore 
has a subcubic time complexity. 

2. The complexity is rank sensitive of the form 0{mnr^^'^), where r is the 
rank of the m x n input matrix. 

3. The algorithm is in-place, that is, only 0(1) extra memory allocation 
for field elements is required beyond what is needed for the matrix 
products. 

4. Used as a building block for most common computations in linear alge- 
bra, the algorithm achieves the best constant in the leading term of the 
time complexity. The only exception is for the reduced echelon form 
with u = log2 7, where the constant is 8.8 instead of 8 for Algorithm 
GaussJordan. 

Among the set of Gaussian elimination algorithms studied here, the algorithm 
for CUP decomposition is the only one satisfying all the above conditions. 
For these reasons it has been chosen for the implementation of Gaussian 
elimination in the FFLAS-FFPACK and LinBox libraries for dense matrices 



over word-size finite fields (The FFLAS-FFPACK Group, 2011 The LinBox 



Group, 2011), as well as for the M4RI and M4RIE libraries for dense matrices 



2011 Albrecht et al. 



over GF(2) and GF( 2^), r espectively ( [Albrecht and Bard| |2010[ [Albrecht 

[20ll| ). 



Appendix A. The GaussJordan algorithm 



The GaussJordan algorithm (Storjohann, 2000, Algorithm 2.8) was origi- 
nally presented as a fraction free algorithm for transforming a matrix over an 
integral domain to reduced row echelon form. We give here a simpler version 
of the algorithm over a field. The principle is to transform the matrix to 
reduced row echelon form, slice by slice. The general recursive step reduces 
a contiguous slice of columns of width w starting at column k by 

1. recursively reducing the first half of the slice of columns (of width w/2), 

2. updating the second half of the slice of columns (also of width w/2), 

3. recursively reducing the second half. 
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Algorithm 10: PLE{A) 



Data: A an m x n matrix. 
Result: {r,[io, . . . ,ir-i], P)- 

Ensures: A ^ [L\E] such that embedding L in an m x m unit lower 
triangular matrix makes A = PLE a PLE decomposition of 
A, r = rank(y4) and [io, . . . , ir-i] its column rank profile. 

1 begin 

if n — 1 then 
if A ^ then 

i ^ row index of the first nonzero entry of A 
P -ir- To^i, the transposition of indices and i 
A^ PA 

for i 2 . . . m do Ai^ ^i,o^o,o 
return (1, [0],P) 

return (0, [],/„) 

(ri,[io,...,V-i],Pi)^PLE«:^) 
if n = then 



/* A^l 
/* Ai ^ Pi 



AM2 
Ml 



*/ 

El */ 



{r2, [io, ■ 



_i],P2) ^PLE(A 



k..n \ 
0..rn) 



return (r2, [io + k,..., + k], P2) 



Ak..n , pT /\k..n 

^O..m ^ -'1 ^O..m 

/* A 



Li\Ei A12 
Ml A22 
TRSM(Left, Low, Unit, A^^-'^l, A^^^J 
if ri = m then 
[_ return (n, [io, ir,-i], Pi) 

{r2,[jo,...,Jr,-i],P2)^PLEiA';---J 



AO..ri , pTAO..ri 
r\..m ~ 2 r\..m 



with Li\Ei ri X k, and Mi {n — k) x ri */ 

/* G L^^Ai2 */ 



/* H^A22-MiG */ 
/* A^i ^ P2M1 */ 



for j ■<— ri . . . ri + r2 do 



A3 
j..m 

Ai^l 



Aj+k-ri 
j..m 



/* Moving L2 left next to Mi */ 



-n 



j..m ^ [0] 



P^PiDiag(/,,,P2) 
return (ri + r-y. [io- 



■h;-i-]o + k jr,-i + k].P) 
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4. composing the two transformation matrices. 



The algorithm is described in full details in Algorithm [TT] Calling Gauss- 
Jordan(74, 0, 0, n) computes the reduced row echelon form and the associated 
transformation matrix. Once again, the presentation based on the indexing 
of the submatrices shows where all matrices are located in order to illustrate 
the need for extra memory allocation. 

Assuming m = n = r, the time complexity satisfies: 

Tcauss Jordan (?^, w) = 2TGaussJordan(^, w/2) + 2Tmiw/2, w/2, u) 

Substituting the general form T{n,w) = K^nw^~^ into this recurrence 
relation, we obtain = ^i^-i-i - 

Appendix B. The StepForm algorithm 

The StepForm algorithm, due to Schonhage and Keller-Gehrig, is de- 



scribed in (Biirgisser et al. , 1997, §16.5). The algorithm proceeds by first 
transforming the input matrix into an upper triangular matrix, and then re- 
ducing it to echelon form. Thus, the rank profile of the matrix only appears 
at the last step of the algorithm and the block decomposition used for the 
first step is not rank sensitive. As a consequence, the complexity of the algo- 
rithm does not depend on the rank of the matrix. Note that this limitation 
is not because of simplifying assumptions in the analysis of the complexity, 
but because the algorithm is itself not rank-sensitive. 

We now evaluate the constant of the leading term in the complexity of 
the StepForm algorithm under the simplifying asumption m = n = r. The 
description of Algorithms 111,112, and II3 used as sub-phases can be found 



in (Biirgisser et al. , 1997, §16.5) and the references therein. 

Let Ti(n) be the complexity of Algorithm IIi applied to a 2n x n matrix, 
and let T2(n) be the complexity of Algorithm 112 applied to an n x n matrix. 



We have 



Tl Ti Tl Tl 

T,{n) = 4Ti(-) + 4MM(-, ^,^)+ MM(n, n, n), 



from which we deduce 



90^-2 _|_ 1 
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Algorithm 11: GaussJordan(74, A;, s, w) 



Data: A an m x n matrix over a field. 

Data: k, s the column, row of the left-top coefficient of the block to reduce. 
Data: w the width of the block to reduce. 

Result: (r, P, Q), where P, Q are permutations matrices of order m and w. 



Ensures: r = rank(A^ ;^+'") and A 



p Ak..k+w 

^^O..m — 



AO..k 
^O..m 





F 




X2 
X3 


G 



A k-\-w..n 



such that 



F 
Ir G 




Q 



IsXi 

X2 

X3 Im- 

begin 

if w = 1 then 

if Ai,^ ^ [0] then 

j the row index of the first nonzero entry of A^ , 
P ^ Tsj the transposition of indices s and j 
A^ PA 
return (1,P, [1]) 

else 

|_ return (O,/^, [1]) 



else 



h ^ [w/2\ 

(ri, Pi, Qi) = GaussJordan(A. k, s, h) 



/* Notations: A = 

s + ri; g ^ k + h 
MM - 





Xi 


F 


Yi 




* 


X2 


E 


Y2 






X3 




Y3 





where X2 is ri x n . 



*/ 



Ag..k+w 
^O..s • 



temp ^ 



\ temp, Af.-.r"', 1,0) 



^s..t 

MM 

MM(4l+'^s^l-;r"',^f;;r'",i,i) 

(^2, P2, Q2) = GaussJordan(A, g, t, w 



g..k+w 

.k+w 



/* top left indices of Y3 */ 

1, 1) /* Yi^Yi+ X1Y2 */ 

/* a temporeiry is needed for Y2 */ 

/* Y2 ^ X2Y2 */ 

/* Y3^Y3+ X3Y2 */ 



h) 



/* Notations: A 



MM , , , i, i) 





Xi 


El 


Zi Fi 






X2 


E2 






* 


xi 






* 






z\ 





with Z3 72 X 72 



/* 







\XA 




\Z{\ 


X2 




X2 


+ 


Z2 



*/ 



iik..k+r\ 
^t..t+r2 

■g+r2 



temp 

MM (Af;.f , temp, ,1,0) 



MM A'^--k+ri Ak.-k+ri x 



/* a temporary is needed for X^ */ 

/* X'^ ^ Z^X'^ */ 
/* X'^^X'^ + ZiXl, */ 



ri 



^W — h — T2- 

return (n + r2, P2P1, Q) 



Qi 



Q2 
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We have 



from which we deduce 

T2in) = a (^^j:^^ + (2.-1 _ 1X2-2 _!)) 

Under our assumption n = m = r, Algorithm 11^ does not perform any 
operation and the total complexity is therefore T2(n). For {u!,Cu;) = (3,2) 
we obtain T2{ri) — 4n^, and for (a;, C^^) = (log2 7, 6) we obtain T2{n) — 
^n'°g2 7 = 15.2^2-81. 

5 

Finally, the algorithm does not specify that the transformation matrix 
generated by algorithm 112 is lower triangular: it needs to be stored in some 
additional memory space and thus the algorithm is not in-place. 
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